A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
نویسندگان
چکیده
BACKGROUND In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset-in its entirety-before training/test set based prediction error estimation by cross-validation (CV)-an approach referred to as "incomplete CV". Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values. METHODS We devise the easily interpretable and general measure CVIIM ("CV Incompleteness Impact Measure") to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA. RESULTS Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings. CONCLUSIONS While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.
منابع مشابه
Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation
In practical applications of supervised statistical learning the separation of the training and test data is often violated through performing one or several analysis steps prior to estimating the prediction error by cross-validation (CV) procedures. We refer to such practices as incomplete CV. For the special case of preliminary variable selection in highdimensional microarray data the corresp...
متن کاملApplication of Grey System Theory in Rainfall Estimation
Considering the fact that Iran is situated in an arid and semi-arid region, rainfall prediction for the management of water resources is very important and necessary. Researchers have proposed various prediction methods that have been utilized in such areas as water and meteorology, especially water resources management. The present study aimed at predicting rainfall amounts using Grey Predicti...
متن کاملApplication of the MoDrY model for the estimation of potato yielding
The study was conducted with the application of the model MoDrY (Model-Dry periods-Yield) for the estimation of the level of potato yields on the basis of dry periods occurring during the particular periods between the phenological phases of the crop plant. A characteristic feature of this model, unlike most existing weatheryield models, is that the principle of its operation is based only ...
متن کاملApplication of Wavelet Neural Network in Forward Kinematics Solution of 6-RSU Co-axial Parallel Mechanism Based on Final Prediction Error
Application of artificial neural network (ANN) in forward kinematic solution (FKS) of a novel co-axial parallel mechanism with six degrees of freedom (6-DOF) is addressed in Current work. The mechanism is known as six revolute-spherical-universal (RSU) and constructed by 6-RSU co-axial kinematic chains in parallel form. First, applying geometrical analysis and vectorial principles the kinematic...
متن کاملA Combinatorial Algorithm for Fuzzy Parameter Estimation with Application to Uncertain Measurements
This paper presents a new method for regression model prediction in an uncertain environment. In practical engineering problems, in order to develop regression or ANN model for making predictions, the average of set of repeated observed values are introduced to the model as an input variable. Therefore, the estimated response of the process is also the average of a set of output values where th...
متن کامل